Assignment: VAST Mini-Challenge 2
In the roughly twenty years that Tethys-based GAStech has been operating a natural gas production site in the island country of Kronos, it has produced remarkable profits and developed strong relationships with the government of Kronos. However, GAStech has not been as successful in demonstrating environmental stewardship.
In January, 2014, the leaders of GAStech are celebrating their new-found fortune as a result of the initial public offering of their very successful company. In the midst of this celebration, several employees of GAStech go missing. An organization known as the Protectors of Kronos (POK) is suspected in the disappearance, but things may not be what they seem.
Both historical vehicle tracking data and transaction data from loyalty and credit card will be used to observe the following issues:
The data source are available publicly on VAST Challenge 2021 website under the sub section Mini-Challenge 2. The data used for the project are as follows:
Figure 1: Map of Abila, Kronos
| LastName | FirstName | BirthDate | BirthCountry | Gender |
|---|---|---|---|---|
| Bramar | Mat | 1981-12-19 | Tethys | Male |
| Ribera | Anda | 1975-11-17 | Tethys | Female |
| Pantanal | Rachel | 1984-08-22 | Tethys | Female |
| Lagos | Linda | 1980-01-26 | Tethys | Female |
| Mies Haber | Ruscella | 1964-04-26 | Kronos | Female |
| Forluniau | Carla | 1981-06-02 | Kronos | Female |
| LastName | FirstName | CarID | CurrentEmploymentType | CurrentEmploymentTitle |
|---|---|---|---|---|
| Calixto | Nils | 1 | Information Technology | IT Helpdesk |
| Azada | Lars | 2 | Engineering | Engineer |
| Balas | Felix | 3 | Engineering | Engineer |
| Barranco | Ingrid | 4 | Executive | SVP/CFO |
| Baza | Isak | 5 | Information Technology | IT Technician |
| Bergen | Linnea | 6 | Information Technology | IT Group Manager |
| timestamp | location | price | last4ccnum |
|---|---|---|---|
| 1/6/2014 7:28 | Brew’ve Been Served | 11.34 | 4795 |
| 1/6/2014 7:34 | Hallowed Grounds | 52.22 | 7108 |
| 1/6/2014 7:35 | Brew’ve Been Served | 8.33 | 6816 |
| 1/6/2014 7:36 | Hallowed Grounds | 16.72 | 9617 |
| 1/6/2014 7:37 | Brew’ve Been Served | 4.24 | 7384 |
| 1/6/2014 7:38 | Brew’ve Been Served | 4.17 | 5368 |
| Timestamp | id | lat | long |
|---|---|---|---|
| 01/06/2014 06:28:01 | 35 | 36.07623 | 24.87469 |
| 01/06/2014 06:28:01 | 35 | 36.07622 | 24.87460 |
| 01/06/2014 06:28:03 | 35 | 36.07621 | 24.87444 |
| 01/06/2014 06:28:05 | 35 | 36.07622 | 24.87425 |
| 01/06/2014 06:28:06 | 35 | 36.07621 | 24.87417 |
| 01/06/2014 06:28:07 | 35 | 36.07619 | 24.87406 |
| timestamp | location | price | loyaltynum |
|---|---|---|---|
| 01/06/2014 | Brew’ve Been Served | 4.17 | L2247 |
| 01/06/2014 | Brew’ve Been Served | 9.60 | L9406 |
| 01/06/2014 | Hallowed Grounds | 16.53 | L8328 |
| 01/06/2014 | Coffee Shack | 11.51 | L6417 |
| 01/06/2014 | Hallowed Grounds | 12.93 | L1107 |
| 01/06/2014 | Brew’ve Been Served | 4.27 | L4034 |
4.1 Past MITB Visual Analytics project were reviewed and evaluated prior to the assignment.
4.2 The solutions submitted for VAST challenge 2014 were also reviewed on their repository webpage(“VAST Challenge 2014:MC2 - Patterns of Life Analysis” 2014).
Submission entry from the University of Buenos Aires - Tralice (Villordo et al. 2014) included a multi-layered horizontal bar graph that showed the GPS movement for each employment type. The background highlight to indicate the weekend provided a good contrast and representation for the differentiation between weekday and weekends.
Submission entry from KU Leuven (Chua et al. 2014) used a boxplot to visualise the credit card spending price at each location. Boxplot allows for distinct and clear visualisation of outliers in the transaction price. However, boxplot also provides informative details such as the median, 25 and 75 percentile price for each location which was not reflected in the boxplot. Furthermore, the 10,000 dollars outliers caused the y-axis tick marks to be large and each individual boxplot became too small on the plot.
Submission entry from the University of Calgary (Sahaf et al. 2014) utilised parallel coordinate plot to show the interaction and relationship between different categorical and numerical variables. The visualisation provides story telling insights between the different variables.
Most past submission includes the map overlayed with the GPS lines and points to show the movement of each car. I would like to highlight the submission from Central South University (Zhao et al. 2014) where the map utilise different colors for lines and dots to present their findings. The variation in colors allowed for better visualisation and clarity of the different parts to be highlighted to gain insights. However, due to the overlap of GPS data, an interactive map with tooltip will allow better interpretation of the findings.
Submission from Fraunhofer IAIS and City University London (Andrienko, Andrienko, and Fuchs 2014) and RBEI-Bangalore (Singhal et al. 2014) both used network cluster and analysis to investigate the relationships between GAStech employees. Fraunhofer IAIS university used an ego-centric graph together whereas RBEI used a combination of fragmented and node-only layout to visual the relationship by connecting each employees. The network analysis is an informative visualisation to provide an overview of potential relationships between employees or even connecting employees to different mediums such as the locations or emails.
Submission from University of Buenos Aires - Alcoser (Flores, Lopez, and Forero 2014) used the sankey diagrams to visualise the locations where employees frequently visits. Sankey diagram shows how the quantities flow from one state to another and is usually used to show flows or processes.
Submission from University of Bueons Aires - Croceri (Croceri and Guzzi 2014) used a scatter plot to show the distance average speed against the speed for each employees route. The visualisation displayed extreme outliers effectively based on the car speed.
1. Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies?
The following packages are loaded for data preparation and visualisation.
packages = c('tidyverse', 'lubridate', 'MASS',
'ggplot2', 'cdparcoord', 'ggiraph', 'plotly',
'geosphere', 'sf','rgeos', 'crosstalk',
'raster', 'tmap','visNetwork','ggraph','tidygraph',
'ggalluvial')
for(p in packages){
if(!require(p, character.only=T)){
install.packages(p)
}
library(p, character.only=T)
}
The credit card and loyalty card datasets were loaded and the structure was checked.
glimpse(cc)
Rows: 1,490
Columns: 4
$ timestamp <chr> "1/6/2014 7:28", "1/6/2014 7:34", "1/6/2014 7:35"~
$ location <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
glimpse(loyalty)
Rows: 1,392
Columns: 4
$ timestamp <chr> "01/06/2014", "01/06/2014", "01/06/2014", "01/06/~
$ location <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~
Customer would usually use credit card (cc) with their loyalty card, hence joining both data allows the tagging of cc to loyalty card number. A suitable left join on CC data with loyalty data using timestamp, location and price will be performed. However, both timestamp field are in character format instead of datetime format. The following adjustment will be performed:
## 1. Create column "datetime" in datetime format "YYYY-dd-mm HH:MM:SS"
## 2. Create column "date" in date format "YYYY-dd-mm"
## 3. Change encoding of locations name
cc <- as_tibble(lapply(cc, iconv, to="ASCII//TRANSLIT"))
cc <- cc %>% mutate(datetime = mdy_hm(timestamp), date = date(datetime),
price = as.numeric(price), last4ccnum=as.factor(last4ccnum))
## 1. Create column "date" in date format "YYYY-dd-mm"
## 2. Change encoding of locations name
loyalty <- as_tibble(lapply(loyalty, iconv, to="ASCII//TRANSLIT"))
loyalty <- loyalty %>% mutate(date = date(mdy(timestamp)), price=as.numeric(price))
glimpse(cc)
Rows: 1,490
Columns: 6
$ timestamp <chr> "1/6/2014 7:28", "1/6/2014 7:34", "1/6/2014 7:35"~
$ location <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <fct> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
$ datetime <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ date <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
glimpse(loyalty)
Rows: 1,392
Columns: 5
$ timestamp <chr> "01/06/2014", "01/06/2014", "01/06/2014", "01/06/~
$ location <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~
$ date <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
Prior to joining both data, a quick glance of the aggregated summary statistics in table 1 showed that there are more credit card transaction as compared to loyalty card transaction for each day. This could imply that employees did not use their loyalty card when they perform a transaction with their credit card and a perfect join of the two dataset was not possible. A left join of cc and loyalty dataset by location, date and price was performed.
## Summary statistics for cc and loyalty transaction per day
cc_t<-merge((cc %>% group_by(date) %>% summarize(cc_count = n())),
(loyalty %>% group_by(date) %>% summarize(loyalty_count = n())),
by="date") %>% mutate(diff = cc_count-loyalty_count)
knitr::kable(cc_t, "simple",
caption="Summary statistics for cc and loyalty transaction per day")
| date | cc_count | loyalty_count | diff |
|---|---|---|---|
| 2014-01-06 | 128 | 119 | 9 |
| 2014-01-07 | 130 | 122 | 8 |
| 2014-01-08 | 129 | 122 | 7 |
| 2014-01-09 | 133 | 118 | 15 |
| 2014-01-10 | 116 | 103 | 13 |
| 2014-01-11 | 61 | 51 | 10 |
| 2014-01-12 | 55 | 54 | 1 |
| 2014-01-13 | 121 | 117 | 4 |
| 2014-01-14 | 128 | 123 | 5 |
| 2014-01-15 | 126 | 122 | 4 |
| 2014-01-16 | 131 | 123 | 8 |
| 2014-01-17 | 113 | 108 | 5 |
| 2014-01-18 | 70 | 67 | 3 |
| 2014-01-19 | 49 | 43 | 6 |
## Left join cc with loyalty data
trans <- left_join(cc, loyalty, by=c("location", "date", "price")) %>%
dplyr::select(-c(timestamp.x, timestamp.y, datetime))
glimpse(trans)
Rows: 1,496
Columns: 5
$ location <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <fct> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
$ date <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ loyaltynum <chr> "L8566", NA, "L8148", "L5553", "L3800", "L2247", ~
The trans data mostly tagged a unique “last4ccnum” to a unique “loyaltynum.” However, the number of rows increase from 1490 to 1496, implying that multiple matches occur. It is most likely because there were 6 transaction in the loyalty data with the same location, date and price value from different loyaltynum card.
To investigate the multiple tagging of each unique cc number or unique loyalty card number, the data was transformed and visualise using an interactive parallel coordinate graph in Figure 2. Clicking on either vertical axis “last4ccnum” or “loyaltynum” highlights only the matching lines.
bind_rows(
trans %>% na.omit() %>%
group_by(last4ccnum)%>% filter(n_distinct(loyaltynum)>1),
trans %>% na.omit() %>%
group_by(loyaltynum) %>% filter(n_distinct(last4ccnum)>1)
) %>% distinct() %>% mutate(last4ccnum = as.character(last4ccnum)) %>%
dplyr::select(last4ccnum,loyaltynum) %>%
discparcoord(k=1000,
interactive=TRUE,
name="Multiple tags of CC and loyalty number")
Figure 2: Parallel Coordinate plot of CC with multiple tags to Loyalty card number
Selecting credit card number ending 8332, 7889, 5921, 5368, 4948 and 4795 revealed that those credit card were tagged to two different unique loyalty card number and one of them has low transaction count which was represented by the dark brown line. Drilling down on the 6 credit card numbers in the trans data, the matching row had only 1 transactions. This imply that there were two loyalty card transactions that recorded the same date, location and price, resulting in a one to many join that fulfilled all conditions. Hence, these 6 rows of transaction were the difference in row count from the original cc data and the trans data.
Credit card number 1286 was tagged to loyalty number L3288 and L3572 with 15 and 13 transactions respectively. On the other hand, loyalty number L3288 is also tagged to a unique cc number 9241 with 13 transactions. A possible deduction would be the owner of cc 9241 loyalty card is L3288 and owner of cc 1286 loyalty card is L3572. However, the owner of cc 1286 often paid and use L3288 loyalty card. This could suggest close relationship between owners of cc 1286 and 9241.
Loyalty number L6267 wass tagged to cc number 6899 and 6691 with 23 and 20 transaction respectively. On the other hand, both cc 6899 and 6691 had only one unique tag to the loyalty card. Possible deduction could be that the owner of credit card number 6899 and 6691 is the same person using loyalty card L6267. Another deduction would be loyalty number L6267 is shared among the owners of cc 6899 and 6691. If the latter deduction is correct, this could suggest close relationship between owners of cc 6899 and 6691.
With these information, a new dataset card_tag was created to tag the owners of their cc and loyalty card numbers together. However, there were 409 transactions in dataset trans that were not tagged.
The 409 cc transactions that were not tagged were analysed by mapping the cc and loyalty card. Thereafter, a left join of non-tagged transactions to the loyalty data by field “date,” “location” and loyaltynum" was performed. From Figure 3, it was observed that most of the difference in cc card price and loyalty price converges to “20”, “40”, “60” and “80” dollars. A possible deduction based on the price difference in denomination of “20” could suggest some form of discount or rebate mentioned in the background.
A deliberate shortfall is not possible as those transactions were evenly spread across different days and locations. Furthermore, as the occurrence of the difference in price exist for multiple cc and loyalty card, it was not possible that the shortfall were targeted towards specific owners or at specific locations.
## Non matching cc and loyalty card transaction
non_match_cc <- anti_join(cc, (trans %>% na.omit())) %>% left_join(card_tag)
## Non matching loyalty card and cc transaction
non_match_loy <- anti_join(loyalty, (trans%>%na.omit()))
## All non matching transaction
non_match_trans <- left_join(non_match_cc,
non_match_loy,
by=c("location", "date", "loyaltynum" )) %>%
na.omit() %>%
mutate(diff=price.x-price.y) %>%
filter(diff>=0)
## Remove outliers, select columns and visualise using parallel coordinate plot
non_match_trans %>%
filter(!(diff==boxplot(non_match_trans$diff, plot = FALSE)$out)) %>%
dplyr::select(last4ccnum,loyaltynum,location,price.x,price.y, diff) %>%
rename(price_cc = price.x, price_loyalty = price.y) %>%
mutate(last4ccnum = as.character(last4ccnum)) %>%
discparcoord(k=1000,
interactive=TRUE,
name="Non-matching transactions by cc and loyalty number")
Figure 3: Parallel Coordinate plot of CC to Loyalty card number with discount
There was a subset of cc transactions that are not tagged to any loyalty card transactions. Possible deductions could be that owners forgot their loyalty card when making the transactions or there might be suspicious activities in these transactions where owners deliberately avoided using their loyalty card. This subset of transactions was visualise with a boxplot in Figure 4. The boxplot displayed one extreme outlier at Frydos Autosupply n’ More. Hovering over the red outlier circle indicates that the owner of cc 9551 spent 10,000 dollars in that transaction whereas the median price is 134.9 at Frydos Autosupply n’ More. This transaction was extremely suspicious because of the extreme outlier spending and the owner did not use his/her loyalty card despite being such a high amount transactional value.
## Transactions match equally from cc and loyalty card
match_cc <- left_join((left_join(cc, card_tag)),
loyalty, by=c("location","date","price")) %>%
na.omit() %>%
group_by(last4ccnum, loyaltynum.y) %>% filter(n()>1) %>%
dplyr::select(-(timestamp.y)) %>%
rename(timestamp = timestamp.x,
loyaltynum_owner = loyaltynum.x,
loyaltynum_trans = loyaltynum.y) %>%
mutate(trans_match = 1)
## Transactions match with difference in 20 dollars denomination
match_cc_dis <- anti_join(cc, match_cc, by=c("date","location","price")) %>%
left_join((non_match_trans %>% filter(diff %in% c(20, 40, 60, 80))),
by=c("location", "last4ccnum","date","price"="price.x")) %>%
na.omit() %>%
dplyr::select(-timestamp.x, -datetime.y, -timestamp.y) %>%
rename(datetime = datetime.x,
loyaltynum_trans = loyaltynum,
price_loy = price.y) %>%
mutate(trans_match = 1)
## Transactions with cc transactions but not match to loyalty card
no_loy_trans <- anti_join(cc, match_cc, by=c("date","location","price")) %>%
anti_join(match_cc_dis, by=c("date","location","price")) %>%
mutate(trans_match = 0)
## Tagging all information on transactions from cc and loyalty to final_trans
final_trans <- bind_rows(match_cc, match_cc_dis, no_loy_trans)
## Determine median price per location
median_price <- no_loy_trans %>%
group_by(location) %>%
summarize(med=median(price))
## Data transformation for boxplot plotting
no_loy_trans_1 <- no_loy_trans %>%
left_join(median_price, by=c("location"))
## Boxplot function
boxplot1 <- ggplot(no_loy_trans_1, aes(x=location, y=price, text=paste("Median:", med))) +
geom_boxplot(outlier.color="red",outlier.fill="red") +
geom_point(alpha=0) + scale_y_log10() + coord_flip() +
ggtitle("Boxplot of CC transaction NOT tagged to loyalty card") +
theme(axis.title=element_blank(),
plot.title=element_text(size=16, face="bold")) +
xlab("Price")
boxplot_p1<-ggplotly(boxplot1, width_svg = 7, height_svg = 7)
boxplot_p1$x$data[[1]]$hoverinfo <- "none"
# overrides black outline of outliers
boxplot_p1$x$data[[1]]$marker$line$color = "red"
# overrides black extreme outlier color
boxplot_p1$x$data[[1]]$marker$outliercolor = "red"
# overrides black not as extreme outlier color
boxplot_p1$x$data[[1]]$marker$color = "red"
boxplot_p1
Figure 4: Boxplot of cc transaction without loyalty card
To determine the most popular location in Abila, the visualisation in Figure 5 shows the frequency of the transactions and the transaction prices for each location. The plot Number of transactions per day by location shows which location had the highest number of transaction each day separated by time period and the weekends are shaded in grey. The plot Boxplot of transaction prices per location shows the prices for each location. Log transformation was performed on the boxplot x-axis(Price). The following insights are inferred from the plot.
1. Transactions occurring only on weekdays morning.
The 3 location seems to be coffee shops based on their location name or logo and Brew’ve Been Served is the most popular location among them. Based on the locations, price and timestamp of the transactions, a possible deduction would be these coffee shops serves take-out coffee and are located in between employees home and GAStech. The median price of each transactions were similar for all 3 locations at around 12 dollars. From the map, Coffee Cameleon is the nearest to GAStech but Brew’ve Been Served has more transactions. making Brew’ve Been Served the most popular morning coffee take-out choice among the employees.
2. Transactions occurring only on weekdays afternoon.
Based on the location name or logo, these 4 location seems to be food and beverage outlets. The median price for these locations range from 12 to 15 dollars. A possible deductions could be these location only operates on weekday lunch time and serves drinks such as coffee as they have similar price range as the take-out coffee mentioned previously.
3. Transactions occurring daily during the afternoon or night period.
The 6 locations has transactions from both afternoon and night time period on all days with a median price of 28 to 32 dollars. A possible deduction based on the location names, logo and transaction trend indicates that these are also food and beverage outlets. However, the higher median price and frequent transaction during both afternoon and night period might suggest that these are restaurants that serves full meals for lunch and dinner.
4. Higher value transactions on weekdays only.
These locations has higher median price compared to the others. The company name and logo suggests that the locations are customer or supplier of GAStech. As the bulk of transaction are on the weekday, a possible deduction would be these locations are related to work. The higher median price value could be due to the purchase raw materials which translate to much higher price transacted on weekdays only.
5. Suspicious transaction.
In the boxplot, there is an extreme outlier of a 10,000 dollars while the median price was only 149 dollars. This particular transaction was flagged out in our previous analysis of cc transaction that were not tagged to loyalty card. As individuals are more likely to use loyalty card in conjunction with the loyalty card, the scenario for this transaction further exacerbated the suspicion.
There were frequent transactions performed at Kronos Mart during the midnight period on Monday and both Sundays. The 5 transactions in during midnight is not common and it only occurs only at one specific location. These 5 transactions performed were not tagged to a loyalty card as well. This raises suspicion on the cc owner.
In the boxplot, there was an extreme outlier of 1,239.41 dollars while the median was only 211.47 dollars. It was six times the median price which might be a suspicious transactions. However, looking at the frequency of transactions at Albert’s Fine Clothing, it seems like a common place to buy clothing. Possible deduction was the person was buying lots of clothing for his family or friends, amounting to a much higher price than usual.
## Data manipulation to add more factors
final_trans_1 <- final_trans %>% ungroup() %>%
mutate(day = as.factor(wday(date)),
wkday = ifelse(day == "6" | day =="7", "weekend", "weekday"),
time_bin = case_when(
hour(datetime)>=0 & hour(datetime)<6 ~ "Midnight",
hour(datetime)>=6 & hour(datetime)<12 ~ "Morning",
hour(datetime)>=12 & hour(datetime) <18 ~ "Afternoon",
hour(datetime)>=18 ~ "Night"),
time_bin = factor(time_bin,
levels = c("Midnight", "Morning", "Afternoon", "Night"))
)
## Data transformation to plot Bar graph for transaction frequency
freq<- final_trans_1 %>%
group_by(location, date, time_bin) %>% summarize(co=n())
freq_location <- ggplot(freq, aes(x=date, y=co, fill=time_bin,
tooltip= paste(co, " transactions at ",location, " on ", date, time_bin))) +
geom_col_interactive() +
annotate(geom="rect", xmin=ymd(20140111)-.5, xmax=ymd(20140113)-.5,
ymin=-Inf, ymax=Inf, fill='dark grey' , alpha=0.5) +
annotate(geom="rect", xmin=ymd(20140118)-.5, xmax=ymd(20140120)-.5,
ymin=-Inf, ymax=Inf, fill='dark grey' , alpha=0.5) +
facet_wrap(~location) +
ggtitle("Number of transactions per day by location") +
xlab("Date") + ylab("Number of transactions") +
labs(fill="Time period") +
theme(plot.title=element_text(size=20,face="bold"),
axis.title=element_text(size=14,face="bold"),
strip.text = element_text(size = 6),
axis.text=element_text(size=6),
axis.text.x=element_text(angle=45, hjust=1),
legend.position="bottom")
# Find median price per location
median_price_final <- final_trans_1 %>%
group_by(location) %>%
summarize(med=median(price))
## Data transformation for boxplot plotting
final_trans_1 <- final_trans_1 %>%
left_join(median_price_final, by=c("location"))
## Boxplot plotting
boxplot <- ggplot(final_trans_1, aes(x=location, y=price, text=paste("Median:", med))) +
geom_boxplot(outlier.color="red",outlier.fill="red") +
geom_point(alpha=0) + scale_y_log10() + coord_flip() +
ggtitle("Boxplot of transaction prices per location") +
theme(axis.title=element_blank(),
plot.title=element_text(size=20, face="bold"))
boxplot_p<-ggplotly(boxplot)
boxplot_p$x$data[[1]]$hoverinfo <- "none"
# overrides black outline of outliers
boxplot_p$x$data[[1]]$marker$line$color = "red"
# overrides black extreme outlier color
boxplot_p$x$data[[1]]$marker$outliercolor = "red"
# overrides black not as extreme outlier color
boxplot_p$x$data[[1]]$marker$color = "red"
## Plot Interactive Bar chart and Boxplot
girafe(ggobj=freq_location, width_svg = 7, height_svg = 7)
Figure 5: Visualize transactions history
boxplot_p
Figure 5: Visualize transactions history
2. Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find?
The GPS dataset has rows of GPS coordinates that were logged every few seconds. This signifies that the car was moving and logging different GPS coordinates. The data was transformed to only keep the stationary GPS coordinate for each car by determining rows where the time lag between subsequent GPS log by each car id was more than 5 minutes. 5 minutes was selected because the waiting time at a traffic lights is around 3 to 5 minutes hence the upper bound was chosen to eliminate situations where the stationary GPS coordinates were due stoppage at traffic lights.
2.1 The first anomaly to be investigated is the high transaction price of 10,000 dollars performed at Frydos Autosupply n’ More on 13/01/2014 night by cc 9951.
Based on the location name and logo, it is highly likely to be a mechanic repair shop for vehicle. The transaction without a matching loyalty card transaction made it more suspicious. The transaction records for cc 9951 was extracted and observed for 13/01/2014 in table 2.
There were 5 transactions made and 3 of them did not match the loyalty card transaction data. This eliminates the possibility of the owner forgetting to bring his/her loyalty card for that particular day. There were two transactions made with a time difference of 10 minutes and one of them did not use the loyalty card during both afternoon and night time period each. To further analyse the transactions, the gps log data was visualise on Abila map.
## Transactions on 13/01/2014 at "Frydos Autosupply n' More"
knitr::kable(final_trans_1 %>%
filter(last4ccnum==9551 & date == dmy(13012014)) %>%
dplyr::select(datetime,location,price,last4ccnum,trans_match)%>%
arrange(datetime), "simple",
caption="Table of transaction for cc 9951 on 13/01/2014")
| datetime | location | price | last4ccnum | trans_match |
|---|---|---|---|---|
| 2014-01-13 06:04:00 | Daily Dealz | 2.01 | 9551 | 0 |
| 2014-01-13 13:18:00 | U-Pump | 55.25 | 9551 | 0 |
| 2014-01-13 13:28:00 | Hippokampos | 30.51 | 9551 | 1 |
| 2014-01-13 19:20:00 | Frydos Autosupply n’ More | 10000.00 | 9551 | 0 |
| 2014-01-13 19:30:00 | Ouzeri Elian | 28.75 | 9551 | 1 |
Figure 6 shows all cars GPS travel lines for 13/01/2014. From Figure 5 frequency plot for each location, we observe that there was only 2 transaction performed at U-Pump throughout the 2 weeks data. Hovering around the GPS lines right on top of U-Pump reveals that only car id 24 visited the location. Since U-Pump is a petrol kiosk, we can confidently say that car id 24 owner used cc 9951 to make a transaction at “U-Pump.”
Car id 24 GPS line was marked in red and the stationary GPS coordinates were marked as blue dots. These blue dots will represent the GPS coordinates where the car was stationary at the particular location.
Hovering over the blue dot near U-Pump on the map shows the car stopping at 12:35:15 and leaving at 13:22:01. This matches the transaction performed at U-Pump at 13:18:00.
Thereafter, the car left U-Pump at 13:22:01 and arrived back in GAStech at 13:27:14. Hence, the transaction at 13:28:00 at Hippokampos was not possible.
In the evening, the car GPS showed that it left GAStech at 17:57:01 and stop around Ipsilantou Avenue at 18:00:31 and subsequently drove off at 19:29:01. The 10,000 dollars transaction at Frydos Autosupply n’ More was performed at 19:20:00 which fits the car gps timeline. Although the car did not stop directly at Frydos Autosupply n’ More, the distance is around 500 metres and it is possible for the owner to walk on foot to make the 10,000 dollars transactions.
Thereafter, the car started driving at 19:29:01 to the north and stop at 19:31:35. This eliminates the possibility of the transaction at 19:30:00 at Ouzeri Elian.
The combination of transaction data of cc 9551 records with car id 24 does not fit perfectly. An observation of the two possible transaction made on cc 9551 by car id 24 owner did not have a loyalty card transaction record matched. Similarly, the other two impossible transactions were both matched to a loyalty card transaction. The trend further confirmed that the transactions made on cc 9551 is extremely suspicious. Probable deduction would be cc 9551 does not belong to car id 24 while the real owner of cc 9551 was someone else who used it during the day too.
## Load Map and SHP file
bgmap <- raster("datasets/MC2-tourist.tif")
## Transform the structure of GPS data for Map
gps <- gps %>% mutate(timestamp=mdy_hms(Timestamp),id=as_factor(id))
gps_sf <- st_as_sf(gps, coords=c("long","lat"), crs=4326)
gps_stop <- gps_sf %>% group_by(id) %>% arrange(timestamp) %>%
mutate(start_diff= as.numeric(timestamp - lag(timestamp,default=first(timestamp)))/60,
stop_diff= as.numeric(lead(timestamp)-timestamp)/60,
date = as.Date(timestamp)) %>%
filter(start_diff>5 | stop_diff >5) %>%
mutate(start_vec=ifelse(start_diff>5,1,0), stop_vec=ifelse(stop_diff>5,1,0))
## Convert to LINE string for 13/01/2014
gps_path_all <- gps_sf %>%
filter(as.Date(gps_sf$timestamp) == dmy(13012014)) %>%
group_by(id) %>%
summarize(m = mean(timestamp), do_union=FALSE) %>% st_cast("LINESTRING")
## Convert to LINE string for 13/01/2014 and car id 24
gps_path_24 <- gps_sf %>%
filter(as.Date(gps_sf$timestamp) == dmy(13012014), id==24) %>%
group_by(id) %>%
summarize(m = mean(timestamp), do_union=FALSE) %>% st_cast("LINESTRING")
## Filter stop points for 13/01/2014 and car id 24
gps_24_points <- gps_stop %>% filter(id ==24 & date == dmy(13012014))
## Plot interactive map
tmap_mode("view")
map1<-tm_shape(bgmap) +
tm_rgb(bgmap, r=1, g=2, b=3, alpha=NA, saturation=1,
interpolate=TRUE, max.value=255) +
tm_shape(gps_path_all)+
tm_lines() +
tm_shape(gps_path_24) +
tm_lines(col ="red") +
tm_shape(gps_24_points)+
tm_dots(col="blue", shape=30)
tmap_leaflet(map1)
Figure 6: GPS data for 13/01/2014
2.2 The second anomaly were the early morning transactions records at Kronos Mart from Figure 5 frequency plot. Table 3 below displays all transactions records at Kronos Mart. Five out of the ten transactions were performed in the wee hours around 3am on three different days and three out of the five occurred on 19/01/2014. These few transactions were particularly unusual and further investigation was conducted.
## Transactions on 13/01/2014 at "Frydos Autosupply n' More"
knitr::kable(final_trans_1 %>%
filter(location == "Kronos Mart") %>%
dplyr::select(datetime,location,price,last4ccnum,trans_match)%>%
arrange(datetime), "simple",
caption="Table of transaction for cc 9951 on 13/01/2014")
| datetime | location | price | last4ccnum | trans_match |
|---|---|---|---|---|
| 2014-01-10 09:30:00 | Kronos Mart | 203.91 | 7688 | 0 |
| 2014-01-12 03:39:00 | Kronos Mart | 277.26 | 8156 | 0 |
| 2014-01-13 03:00:00 | Kronos Mart | 147.30 | 5407 | 0 |
| 2014-01-13 08:01:00 | Kronos Mart | 159.06 | 6816 | 0 |
| 2014-01-14 08:20:00 | Kronos Mart | 58.85 | 6899 | 0 |
| 2014-01-16 07:30:00 | Kronos Mart | 298.83 | 7108 | 0 |
| 2014-01-17 08:08:00 | Kronos Mart | 286.24 | 1415 | 0 |
| 2014-01-19 03:13:00 | Kronos Mart | 87.66 | 3484 | 0 |
| 2014-01-19 03:45:00 | Kronos Mart | 194.51 | 9551 | 0 |
| 2014-01-19 03:48:00 | Kronos Mart | 150.36 | 8332 | 0 |
The GPS records for 19/01/2014 were visualised to investigate the transactions. From Figure 7, there was no GPS data that passed by nor stop in the vicinity of Kronos Mart on 19/01/2014. The closest stop location was at ROBERTS AND SONS at 13:20:06 to 14:23:01 by car id 30 represented by the blue dot. The timing of the transaction does not coincide with the cc transaction timing.
Hence, possible deduction could be that cc owners of 3484, 9551 and 8332 stays within walking distance to Kronos Mart, therefore eliminating the need to drive their employee car to the location. Another possibility is that the owners of the cc used their own personal vehicles to get there, resulting in no GPS record for employees issued vehicles. Coincidentally, cc 9551 also appeared in these transaction, which warrants additional investigation.
## Map geometry for 19012014
gps_path2 <- gps_sf %>%
filter(as.Date(gps_sf$timestamp) == dmy(19012014) & id !=29) %>%
group_by(id) %>%
summarize(m = mean(timestamp), do_union=FALSE) %>% st_cast("LINESTRING")
gps_points2 <- gps_stop %>% filter(date == dmy(19012014))
## Plot interactive map
tmap_mode("view")
map2<-tm_shape(bgmap) +
tm_rgb(bgmap, r=1, g=2, b=3, alpha=NA, saturation=1,
interpolate=TRUE, max.value=255) +
tm_shape(gps_path2)+
tm_lines() +
tm_shape(gps_points2)+
tm_dots(col="blue", shape=30)
tmap_leaflet(map2)
Figure 7: GPS data for 19/01/2014
2.3 Lastly, we will cross-check and validate the GPS data with the frequency of transactions at each location. We will first validate the weekday movement. From the earlier section, there were three groups of transaction data: weekday morning transaction only, weekday afternoon transactions only and high value transactions on weekdays only. The map with GPS movement on 07/01/2014 was visualise in Figure 8 as there were transactions performed on that day at all of the locations.
The car GPS stationary coordinates in blue dots for Coffee Cameleon and Hallowed Grounds fits the transaction data. However, the blue dots directly on Brew’ve Been Served logo in the map shows that the timing of the stationary coordinates were mainly in the afternoon or evening. This does not match the transaction timing at Brew’ve Been Served. However, looking slightly south near the main road of Ipsilantou Avenue, there were multiple GPS stationary coordinates in the morning and they fit the transaction timing at Brew’ve Been Served. This might be due to the misrepresentation of the location logo on the map.
There are 4 locations that are in this group. Based on the 4 locations name and logo, they seems to be similar to the earlier group consisting of coffee shops.
Table 4 shows the 13 transactions at the 4 locations on 07/01/2014. A common trend observed was the exact same timestamp of 12:00 on all 13 transactions. However, looking at the GPS stationary positions at those location, the GPS stationary coordinates timestamp were in the morning before 09:00 where employees would presumably visit before heading to GAStech for work.
The 4 locations were spread around Abila, and the occurrence of mismatch GPS stationary timestamp were consistent. A possible deduction could be due to faulty Point of Sales (POS) machines at those locations. Alternatively, it might be possible that they are using the same type of POS machine that performed batch processing instead of real-time processing for cc transactions which process at 12:00 daily.
## Transactions on 13/01/2014 at "Frydos Autosupply n' More"
knitr::kable(final_trans_1 %>%
filter((location == "Jack's Magical Beans" |
location == "Brewed Awakenings" |
location == "Coffee Shack" |
location == "Bean There Done That") &
date == dmy(07012014)) %>%
dplyr::select(datetime,location,price,last4ccnum,trans_match, price_loy)%>%
arrange(datetime), "simple",
caption="Table of transaction the 4 locations on 07/01/2014")
| datetime | location | price | last4ccnum | trans_match | price_loy |
|---|---|---|---|---|---|
| 2014-01-07 12:00:00 | Coffee Shack | 16.63 | 7117 | 1 | NA |
| 2014-01-07 12:00:00 | Brewed Awakenings | 6.72 | 8332 | 1 | NA |
| 2014-01-07 12:00:00 | Bean There Done That | 8.03 | 1321 | 1 | NA |
| 2014-01-07 12:00:00 | Jack’s Magical Beans | 18.77 | 9241 | 1 | NA |
| 2014-01-07 12:00:00 | Jack’s Magical Beans | 19.61 | 8156 | 1 | NA |
| 2014-01-07 12:00:00 | Bean There Done That | 51.25 | 1415 | 1 | 11.25 |
| 2014-01-07 12:00:00 | Jack’s Magical Beans | 23.68 | 6899 | 1 | 3.68 |
| 2014-01-07 12:00:00 | Brewed Awakenings | 64.84 | 3853 | 1 | 4.84 |
| 2014-01-07 12:00:00 | Brewed Awakenings | 71.59 | 2540 | 1 | 11.59 |
| 2014-01-07 12:00:00 | Bean There Done That | 53.89 | 1877 | 1 | 13.89 |
| 2014-01-07 12:00:00 | Bean There Done That | 46.25 | 6895 | 1 | 6.25 |
| 2014-01-07 12:00:00 | Jack’s Magical Beans | 69.84 | 2463 | 1 | 9.84 |
| 2014-01-07 12:00:00 | Brewed Awakenings | 12.17 | 7688 | 0 | NA |
Based on the 7 locations name and logo, they are likely to be industrial Places of Interest. Observation from the stationary GPS represented by the blue dots at these locations revealed that only truck drivers with car id 100 and above visited these locations. The stationary GPS timestamp also matches the cc transaction timestamp. Hence, a possible deduction is that these 7 locations are businesses that are close partners with GAStech and the payment were made by the lorry truck driver during the weekdays. This will align with the fact that lorry driver vehicles only operates on weekday working hours.
## Map geometry for 07012014
gps_path3 <- gps_sf %>%
filter(as.Date(gps_sf$timestamp) == dmy(07012014)) %>%
group_by(id) %>%
summarize(m = mean(timestamp), do_union=FALSE) %>% st_cast("LINESTRING")
gps_points3 <- gps_stop %>% filter(date == dmy(07012014))
## Plot interactive map
tmap_mode("view")
map3<-tm_shape(bgmap) +
tm_rgb(bgmap, r=1, g=2, b=3, alpha=NA, saturation=1,
interpolate=TRUE, max.value=255) +
tm_shape(gps_path3)+
tm_lines() +
tm_shape(gps_points3)+
tm_dots(col="blue", shape=30)
tmap_leaflet(map3)
Figure 8: GPS data for 07/01/2014
3. Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data? Please limit your answer to 8 images and 500 words.
In order to tag the owners of each credit card and loyalty card to the car id, we would need to combine several factors together to triangulate the results. The two conditions that will be used to triangulate the data between the three datasets are:
The locations coordinates would be assigned by referencing the tourist map of Abila. However, from the earlier section, we discovered that the tourist map provided might not be accurate in locating the location coordinates as the icons on the tourist map might not represent the exact coordinates of the location.
Furthermore, the tourist map do not have all the locations marked by its logo which will not allows a full join with the locations in the cc transaction data. Table 5 shows the locations from the cc dataset whose logo could not be located visually on the tourist map of Abila. Ranking the number of transaction at each location in descending order, there are high volume of transactions at those locations and the need to map their GPS coordinate is necessary.
## Transactions on 13/01/2014 at "Frydos Autosupply n' More"
locations <- data.frame(location = cc$location) %>%
group_by(location) %>% summarize(number_transactions=n())
knitr::kable(locations %>%
dplyr::filter(location == "Abila Zacharo" |
location == "Brewed Awakenings" |
location == "Daily Dealz" |
location == "Hippokampos" |
location == "Kalami Kafenion" |
location == "Kronos Pipe and Irrigation" |
location == "Octavio's Office Supplies" |
location == "Shoppers' Delight" |
location == "Stewart and Sons Fabrication") %>%
arrange(desc(number_transactions)), "simple",
caption="Table of location with no traceable coordinates")
| location | number_transactions |
|---|---|
| Hippokampos | 171 |
| Abila Zacharo | 72 |
| Kalami Kafenion | 64 |
| Brewed Awakenings | 30 |
| Shoppers’ Delight | 20 |
| Stewart and Sons Fabrication | 18 |
| Kronos Pipe and Irrigation | 6 |
| Octavio’s Office Supplies | 4 |
| Daily Dealz | 1 |
Figure 9 shows the map marked with blue dots representing the stationary GPS coordinate of all the cars except for each employee house. The popular locations can be determined by the frequency of the blue dots at a particular location on the map.
Cross referencing with the transactions table, the locations coordinates were tag with their corresponding coordinates by cross-referencing to the car GPS data and geo-referenced data.
## Getting coordinates of car stop positions
first_gps <- gps_stop %>%
group_by(id) %>%
filter(row_number()==1) %>%
ungroup(id)
gps_pts <- gps_stop %>% ungroup(id) %>%
add_row(first_gps) %>% group_by(id) %>% arrange(timestamp) %>%
filter(!(start_vec==1 & stop_vec==1)) %>%
group_by(id) %>% arrange(timestamp) %>%
mutate( start.time = ifelse(start_vec== 0 & stop_vec==0, timestamp, NA),
start.time = ifelse(start_vec==1, timestamp,NA),
end.time=ifelse(stop_vec==1, timestamp, NA),
start.gps = ifelse(start_vec==0 & stop_vec==0, geometry,NA),
start.gps = ifelse(start_vec==1, geometry,NA),
end.gps=ifelse(stop_vec==1, geometry,NA),
end.time = ifelse(start_vec==1, lead(end.time), end.time),
end.gps = ifelse(start_vec==1, lead(end.gps), end.gps)) %>%
filter(!is.na(start.time))%>%
mutate(end.gps = ifelse(end.gps=='NULL',start.gps,end.gps),
end.time = ifelse(is.na(end.time),start.time, end.time),
start.time= as_datetime(start.time),
end.time=as_datetime(end.time),
next.start.time=lead(start.time),
driving.time=round(difftime(end.time,start.time,units='mins'),2)) %>%
dplyr::select(id, date, start.time, end.time, start.gps, end.gps,
next.start.time, driving.time) %>%
mutate(start.gps=purrr::map(start.gps, st_point) %>% st_as_sfc(crs=4326))%>%
mutate(end.gps=purrr::map(end.gps, st_point) %>% st_as_sfc(crs=4326))
car$CarID <- as_factor(car$CarID)
gps_pts <- left_join(gps_pts, car, by=c("id"="CarID"))
gps_stop_points1 <- gps_pts %>%
mutate(time.stop = difftime(next.start.time, end.time, units=c("mins")),
time.stop = as.numeric(time.stop))%>%
filter(time.stop < 300) %>%
dplyr::select(id, start.time, start.gps)
## Generate map with the stop positions in blue dots
tmap_mode("view")
map_POI<-tm_shape(bgmap) +
tm_rgb(bgmap, r=1, g=2, b=3, alpha=NA, saturation=1,
interpolate=TRUE, max.value=255) +
tm_shape(gps_stop_points1)+
tm_dots(col="blue", shape=30,id="id",
popup.vars=c("Car ID"="id",
"Stationary timestamp" = "start.time",
"GPS:"="start.gps"))
tmap_leaflet(map_POI)
Figure 9: GPS stationary locations
The car id are triangulated by tabulating the centroid coordinates of the GPS data from the stationary GPS stop locations from the map. However, there are few limitations by using the methodology mentioned earlier for tagging the owners.
The interactive heatmap in Figure 10 shows the percentage that were successfully match with the car GPS and cc transaction data by the conditions mentioned earlier. The histogram was also plotted to visualise the distribution of the result. From the two visualisation, we observed that the methodology yield some high percentage match for the car id owner with the cc owner.
# Tagging location coordinates
location_tag <- data.frame(location = c(locations$location,"GAStech"),
long =c(centroid(rbind(c(24.82590612, 36.05102229),c(24.82591819, 36.05092013),c(24.82598413, 36.05097547)))[1],
centroid(rbind(c(24.84592966, 36.07443715),c(24.84598782, 36.07434876),c(24.84595026, 36.07437836)))[1],
centroid(rbind(c(24.85097804, 36.06349268),c(24.85099445, 36.06342076),c(24.85103178, 36.06348173)))[1],
centroid(rbind(c(24.87617634, 36.07713037),c(24.87621582, 36.07713598),c(24.87619872, 36.07715385)))[1],
centroid(rbind(c(24.85626503, 36.07529323),c(24.85631411, 36.07523202),c(24.85634841, 36.07528136)))[1],
centroid(rbind(c(24.85089145, 36.08172086),c(24.85096025, 36.08176242),c(24.85087799, 36.08180554)))[1],
centroid(rbind(c(24.90119998, 36.05402165),c(24.90128202, 36.05408823),c(24.90116585, 36.05411015)))[1],
NA,
centroid(rbind(c(24.88089399, 36.05851786),c(24.88092086, 36.05858619),c(24.8808655, 36.05856303)))[1],
centroid(rbind(c(24.8951996, 36.07073983),c(24.89517891, 36.07062423),c(24.89526281, 36.07069274)))[1],
centroid(rbind(c(24.88983886, 36.05469486),c(24.88978433, 36.05463184),c(24.88977321, 36.05467589)))[1],
centroid(rbind(c(24.86416839, 36.07332041),c(24.86417651, 36.07336116),c(24.86419582, 36.07332868)))[1],
NA,
centroid(rbind(c(24.86068835, 36.08962196),c(24.86068191, 36.08954231),c(24.8607611, 36.08960361)))[1],
centroid(rbind(c(24.84132949, 36.07213193),c(24.84134818, 36.07212045),c(24.4134819, 36.07212044)))[1],
centroid(rbind(c(24.905573, 36.06044638),c(24.90561679, 36.06033304),c(24.90568587, 36.06040053)))[1],
centroid(rbind(c(24.85804364, 36.05970763),c(24.8580772, 36.05975308),c(24.8579808, 36.05976284)))[1],
centroid(rbind(c(24.85804364, 36.05970763),c(24.8580772, 36.05975308),c(24.8579808, 36.05976284)))[1],
centroid(rbind(c(24.90096913, 36.05842562),c(24.90107066, 36.05844726),c(24.90097455, 36.05850897)))[1],
centroid(rbind(c(24.88586605, 36.063639),c(24.88595361, 36.06364584),c(24.88586737, 36.06371539)))[1],
centroid(rbind(c(24.85756422, 36.07660977),c(24.85763811, 36.07664766),c(24.857573, 36.07669909)))[1],
centroid(rbind(c(24.87330651, 36.06751231),c(24.87335583, 36.06750587),c(24.87333867, 36.06755141)))[1],
centroid(rbind(c(24.85237319, 36.06582037),c(24.85241027, 36.06582475),c(24.85237372, 36.06584816)))[1],
centroid(rbind(c(24.89986767, 36.05442391),c(24.89996154, 36.05448329),c(24.89987365, 36.05453273)))[1],
centroid(rbind(c(24.84983351, 36.06587998),c(24.84983936, 36.06582196),c(24.8497762, 36.06583535)))[1],
NA,
centroid(rbind(c(24.88551872, 36.05840982),c(24.88542068, 36.0584603), c(24.88553455, 36.05844325)))[1],
centroid(rbind(c(24.83307421, 36.0653098),c(24.83314028, 36.06523446), c(24.84143955, 36.06403449),c(24.84141463, 36.06410072)))[1],
NA,
centroid(rbind(c(24.87077341, 36.05196196),c(24.87081903, 36.05192066),c(24.87083665, 36.05197804)))[1],
centroid(rbind(c(24.85227441, 36.06324941),c(24.85226894, 36.06330479),c(24.8523291, 36.0632684)))[1],
NA,NA,
centroid(rbind(c(24.87148791, 36.06774029),c(24.8714995, 36.06774623),c(24.87149104, 36.06776587)))[1],
centroid(rbind(c(24.87956897, 36.04802112),c(24.8795714, 36.04804908), c(24.8795745, 36.0480309)))[1]),
lat = c(centroid(rbind(c(24.82590612, 36.05102229),c(24.82591819, 36.05092013),c(24.82598413, 36.05097547)))[2],
centroid(rbind(c(24.84592966, 36.07443715),c(24.84598782, 36.07434876),c(24.84595026, 36.07437836)))[2],
centroid(rbind(c(24.85097804, 36.06349268),c(24.85099445, 36.06342076),c(24.85103178, 36.06348173)))[2],
centroid(rbind(c(24.87617634, 36.07713037),c(24.87621582, 36.07713598),c(24.87619872, 36.07715385)))[2],
centroid(rbind(c(24.85626503, 36.07529323),c(24.85631411, 36.07523202),c(24.85634841, 36.07528136)))[2],
centroid(rbind(c(24.85089145, 36.08172086),c(24.85096025, 36.08176242),c(24.85087799, 36.08180554)))[2],
centroid(rbind(c(24.90119998, 36.05402165),c(24.90128202, 36.05408823),c(24.90116585, 36.05411015)))[2],
NA,
centroid(rbind(c(24.88089399, 36.05851786),c(24.88092086, 36.05858619),c(24.8808655, 36.05856303)))[2],
centroid(rbind(c(24.8951996, 36.07073983),c(24.89517891, 36.07062423),c(24.89526281, 36.07069274)))[2],
centroid(rbind(c(24.88983886, 36.05469486),c(24.88978433, 36.05463184),c(24.88977321, 36.05467589)))[2],
centroid(rbind(c(24.86416839, 36.07332041),c(24.86417651, 36.07336116),c(24.86419582, 36.07332868)))[2],
NA,
centroid(rbind(c(24.86068835, 36.08962196),c(24.86068191, 36.08954231),c(24.8607611, 36.08960361)))[2],
centroid(rbind(c(24.84132949, 36.07213193),c(24.84134818, 36.07212045),c(24.4134819, 36.07212044)))[2],
centroid(rbind(c(24.905573, 36.06044638),c(24.90561679, 36.06033304),c(24.90568587, 36.06040053)))[2],
centroid(rbind(c(24.85804364, 36.05970763),c(24.8580772, 36.05975308),c(24.8579808, 36.05976284)))[2],
centroid(rbind(c(24.85804364, 36.05970763),c(24.8580772, 36.05975308),c(24.8579808, 36.05976284)))[2],
centroid(rbind(c(24.90096913, 36.05842562),c(24.90107066, 36.05844726),c(24.90097455, 36.05850897)))[2],
centroid(rbind(c(24.88586605, 36.063639),c(24.88595361, 36.06364584),c(24.88586737, 36.06371539)))[2],
centroid(rbind(c(24.85756422, 36.07660977),c(24.85763811, 36.07664766),c(24.857573, 36.07669909)))[2],
centroid(rbind(c(24.87330651, 36.06751231),c(24.87335583, 36.06750587),c(24.87333867, 36.06755141)))[2],
centroid(rbind(c(24.85237319, 36.06582037), c(24.85241027, 36.06582475),c(24.85237372, 36.06584816)))[2],
centroid(rbind(c(24.89986767, 36.05442391),c(24.89996154, 36.05448329), c(24.89987365, 36.05453273)))[2],
centroid(rbind(c(24.84983351, 36.06587998),c(24.84983936, 36.06582196),c(24.8497762, 36.06583535)))[2],
NA,
centroid(rbind(c(24.83307421, 36.0653098),c(24.83314028, 36.06523446), c(24.84143955, 36.06403449),c(24.84141463, 36.06410072)))[1],
centroid(rbind(c(24.88551872, 36.05840982),c(24.88542068, 36.0584603), c(24.88553455, 36.05844325)))[2],
NA,
centroid(rbind(c(24.87077341, 36.05196196),c(24.87081903, 36.05192066),c(24.87083665, 36.05197804)))[2],
centroid(rbind(c(24.85227441, 36.06324941),c(24.85226894, 36.06330479),c(24.8523291, 36.0632684)))[2],
NA,NA,
centroid(rbind(c(24.87148791, 36.06774029),c(24.8714995, 36.06774623),c(24.87149104, 36.06776587)))[2],
centroid(rbind(c(24.87956897, 36.04802112),c(24.8795714, 36.04804908), c(24.8795745, 36.0480309)))[2]))
location_tag <- location_tag %>% na.omit()
location_tag <- st_as_sf(location_tag, coords=c("long","lat"), crs=4326)
## join GPS data with transaction data with location coordinates
final_trans_gps <- inner_join(final_trans_1, location_tag, by=c("location")) %>%
rename(loc.coord=geometry)
## Join with car GPS and tag the location to car gps
gps_match <- final_trans_gps %>%
left_join(gps_pts , by=c("date"))%>%
group_by(last4ccnum) %>% arrange(datetime) %>%
filter(datetime > end.time & datetime <= next.start.time + minutes(30)) %>%
mutate(diff.dist = st_distance(loc.coord, end.gps, by_element=TRUE),
diff.dist = as.numeric(diff.dist)) %>%
filter(diff.dist <500)
tagging <-gps_match %>%group_by(last4ccnum, id)%>%
summarize(tag=n()) %>% arrange(desc(tag))
## Get total count of transactions minus the 4 locations per cc num
trans_collapse <- cc %>%
filter(!(location %in% c("Bean There Done That",
"Brewed Awakenings",
"Coffee Shack",
"Jack's Magical Beans"))) %>%
group_by(last4ccnum) %>% summarize(total=n())
## Limit to top 3 match only by percentage
tagging_cc_gps <- left_join(tagging, trans_collapse, by=c("last4ccnum")) %>%
mutate(percent=round(tag/total*100,2))
tag_plot<-ggplot(tagging_cc_gps, aes(x=id, y=last4ccnum,fill=percent))+
geom_tile() + scale_fill_gradient(low="sienna1", high="navyblue") +
xlab("Car ID") +ylab("CC last 4 number")+
labs(fill="% match")
histogram<-ggplot(tagging_cc_gps,aes(percent))+geom_histogram(binwidth=5)+
stat_function(fun=dnorm,aes(color="red"),
args=list(mean=mean(tagging_cc_gps$percent),
sd=sd(tagging_cc_gps$percent)))
ggplotly(tag_plot) %>% layout(hoverlabel=list(bgcolor="white"))
Figure 10: Car GPS tagging to CC number
Figure 10: Car GPS tagging to CC number
Hence, we can confidently infer that matches over 75% will be accurate. However, as there are more cc owners (55 unique owners) than car owners (35 unique car id) and the truck drivers share vehicles (5 unique truck id), we will drop the truck drivers with car id of 100 and above. Observation of the heatmap in figure 10 reveals that car id 23, car id 29 and car id 30 has matches of more than one cc number and car id 28 does not have a match with more than 75%.
From Table 6, we observe that car id 23 matches to three unique cc number with matches over 75%. The highest percentage match to cc 3484 at 91.43% shows high probability for inference, hence the observation that matches to cc 8202 and 8411 will be dropped.
For car id 29 and 30, the matches to cc number percentage are relatively high and defers less than 10%. Further investigation on the GPS map location will be performed to verify which match to retain.
## Get the match of car id to cc last4ccnum
tagging <- tagging_cc_gps %>% mutate(id=as.character(id), id=as.numeric(id)) %>%
filter(percent>=75 & id<100)
knitr::kable(tagging %>% filter(id==23 | id==29 | id==30) %>%
arrange(id), "simple",
caption="Table of employees record and their cc and loyalty number")
| last4ccnum | id | tag | total | percent |
|---|---|---|---|---|
| 3484 | 23 | 31 | 35 | 88.57 |
| 8202 | 23 | 25 | 33 | 75.76 |
| 8411 | 23 | 25 | 32 | 78.12 |
| 3547 | 29 | 18 | 20 | 90.00 |
| 5921 | 29 | 13 | 14 | 92.86 |
| 6901 | 30 | 31 | 37 | 83.78 |
| 8202 | 30 | 25 | 33 | 75.76 |
final_tagging <- tagging %>%
filter(!(last4ccnum==8202 & id==23), !(last4ccnum==8411 & id ==23))
Investigation of car id 28 low cc transactions matches was visualised in Figure 11 and it revealed that the GPS coordinates of car id 28 has lots of noise. The noise in the GPS line caused a wider spread of GPS line in the visualisation on the map and also zig-zag incoherent GPS path. This most probably signifies a faulty GPS signal on the car.
Secondly, we observe that the stop position was not accurate. For example, the frequency of GPS stop coordinates at the extreme south of the map should be at GAStech. Hence, the GPS stop coordinates seems to deviate in the North-West direction. The most probable explanation will be a faulty GPS system since the GPS points were noisy and were not correctly geo-referenced on the map.
## Map geometry for original car id 28 data
gps_path5 <- gps_sf %>%
filter(id==28) %>%
group_by(id) %>%
summarize(m = mean(timestamp), do_union=FALSE) %>% st_cast("LINESTRING")
gps_28_points <- gps_stop %>% filter(id ==28)
## Plot interactive map
tmap_mode("view")
map5<-tm_shape(bgmap) +
tm_rgb(bgmap, r=1, g=2, b=3, alpha=NA, saturation=1,
interpolate=TRUE, max.value=255) +
tm_shape(gps_path5)+
tm_lines() +
tm_shape(gps_28_points)+
tm_dots(col="blue")
tmap_leaflet(map5)
Figure 11: Original GPS for car id 28
After re-calibrating the GPS coordinates for car id 28, Figure ?? shows the GPS movement data for car id 28. With the re-calibrated GPS data, we would match it with the cc transaction data to infer which cc belongs to car id 28.
From the map in Figure ??, the unqiue observation was that car id 28 visited Ahaggo Museum on the 18th and 19th of Jan and frequently patronise Jack’s Magical Beans and Ouzeri Elian over the two weeks.
From the cc transaction table, a search of Ahaggo Museum revealed that cc 1286, 7384 and 9241 made transactions on the 18th and 19th of Jan. Next, a search of Jack’s Magical Beans shows that only cc 9241 out of the three cc made transactions at the location. Lastly, a search of Ouzeri Elian on the datatable reveals that cc 9241 made 6 transactions at the location. Hence, we are confident to infer that car id 28 is the owner of cc 9241.